This section performs spacial and statistical analysis on the data of
Manhattan. In the analysis, the gross_income_per_sq_ft is
treated as the indicator of rental price. Since the COVID-19 pandemic
had great influence on the property market, we mainly focus on reports
after 2019.
From the prospective of tenant rather than property owner, people cares about how to make the most cost-effective choice and rent a satisfactory room with lowest price. This page are intended to provide helpful information.
# load packages and data
library(tidyverse)
library(plotly)
load("E:/2023_autumn/DataScience/Final_Proj/data/cleaned_data.RData")
load("E:/2023_autumn/DataScience/Final_Proj/data/Manhattan_latlonmap.RData")
transformed_rental_income=
transformed_rental_income %>%
separate(boro_block_lot, into = c("borough", "block", "lot"), sep = "-") %>%
mutate(
borough = case_when(
borough == "1" ~ "Manhattan",
borough == "2" ~ "Bronx",
borough == "3" ~ "Brooklyn",
borough == "4" ~ "Queens",
borough == "5" ~ "Staten Island")
) %>%
filter(
borough == "Manhattan"
)
Manhattan_data = Manhattan_data %>%
separate(address, into = c("address", "toremove"), sep = ",") %>%
select(-toremove)
transformed_rental_income =
left_join(transformed_rental_income, Manhattan_data, by = "address")
## Columbia university lat lon 40.80754 -73.96257
## Hammer health science building lat lon 40.84252 -73.94243
for_map = transformed_rental_income %>%
filter(report_year > 2019) %>%
select(-c(latitude,longitude))
income_sqft_M = for_map %>%
plot_mapbox(
lon = ~lon,
lat = ~lat,
color = ~gross_income_per_sq_ft,
mode = 'markers',
alpha = 0.5,
hovertext = ~gross_income_per_sq_ft,
hoverinfo = 'text',
colors = "Spectral"
) %>%
layout(
mapbox = list(
style = "streets",
zoom = 10,
center = list(lat = 40.81, lon = -73.96)
)
)
yearsbuild_M = for_map %>%
plot_mapbox(
lon = ~lon,
lat = ~lat,
color = ~year_built,
mode = 'markers',
alpha = 0.5,
hovertext = ~year_built,
hoverinfo = 'text',
colors = c('red','yellow','green')
) %>%
layout(
mapbox = list(
style = "streets",
zoom = 10,
center = list(lat = 40.81, lon = -73.96)
)
)
elevator_M = for_map %>%
plot_mapbox(
lon = ~lon,
lat = ~lat,
color = ~building_classification,
mode = 'markers',
alpha = 1,
hovertext = ~building_classification,
hoverinfo = 'text'
) %>%
layout(
mapbox = list(
style = "streets",
zoom = 10,
center = list(lat = 40.81, lon = -73.96)
)
)
subplot(income_sqft_M, yearsbuild_M,elevator_M, nrows = 1, margin = 0.01) %>% layout(showlegend = FALSE, title = 'Income/sqft - Year built - Classification')
When considering rent an apartment, people may refers to the location, building’s lifespan, elevator facilities and rental price. The figure above plotted the Income/sqft, Year built and Elevator info according to building’s location. The three plots are in same size and version, providing a comparable overview.
income_sqft_M
The rental price showed great difference from the north edge to south edge of Manhattan. Rentals at upper town that close to Bronx are the cheapest. As the street numbers get lower, the rental price increase and reaches its peak at west village. There is less difference between the rents at west and east side of Manhattan. But specially, rents at the lowers east side are significantly lower than surrondings.
yearsbuild_M
Generally, the building at west side are older than those at east side of Manhattan. There are more new buildings in midtown and lower Manhattan than in uptown Manhattan.
elevator_M
Most of the buildings in Manhattan are equipped with Elevator, while there are still groups of walk-up buildings at uptown near Bronx, midtown west and lower Manhattan. This may closly related to their year built.
Then we are curious about whether the factors(Years built and Elevator Facilities) are truly related to the rental price. We choose a hypothesis test approach to explore their relationship.
for_map %>%
group_by(building_classification) %>%
summarise(
n = n(),
mean = mean(gross_income_per_sq_ft),
sum = sum(gross_income_per_sq_ft),
variance = var(gross_income_per_sq_ft)
)
## # A tibble: 17 × 5
## building_classification n mean sum variance
## <chr> <int> <dbl> <dbl> <dbl>
## 1 C1-WALK-UP 1476 33.2 49058. 196.
## 2 C4-WALK-UP 1195 40.8 48767. 144.
## 3 C5-WALK-UP 366 46.6 17051. 77.3
## 4 C6-WALK-UP 1410 37.7 53185. 183.
## 5 C7-WALK-UP 960 37.6 36140. 228.
## 6 C8-WALK-UP 1 38.0 38.0 NA
## 7 C9-WALK-UP 1 55.4 55.4 NA
## 8 D0-ELEVATOR 265 46.3 12261. 35.4
## 9 D1-ELEVATOR 2553 36.0 91896. 148.
## 10 D2-ELEVATOR 65 36.0 2342. 69.3
## 11 D3-ELEVATOR 1861 38.7 71938. 81.7
## 12 D4-ELEVATOR 3160 40.4 127553. 110.
## 13 D5-ELEVATOR 463 45.8 21203. 49.0
## 14 D6-ELEVATOR 2724 45.9 125110. 62.5
## 15 D7-ELEVATOR 1070 39.4 42170 141.
## 16 D8-ELEVATOR 307 45.5 13958. 59.9
## 17 D9-ELEVATOR 619 43.3 26800. 95.9
for_map %>%
ggplot(aes(x =building_classification, y = gross_income_per_sq_ft,fill =building_classification))+
geom_boxplot()+
theme(
axis.text.x = element_text(angle = 90)
)+
guides(fill=F)

According to the box plot, we have few records on
C8-WALK-UP and C9-WALK-UP, so we filtered them
out in the anova test.
for_anova = for_map %>%
filter(
building_classification!= "C8-WALK-UP",
building_classification!= "C9-WALK-UP"
)%>%
select(c(building_classification,gross_income_per_sq_ft))
res = lm(gross_income_per_sq_ft ~ factor(building_classification), data = for_anova)
anova(res)
## Analysis of Variance Table
##
## Response: gross_income_per_sq_ft
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(building_classification) 14 280558 20039.9 162.64 < 2.2e-16 ***
## Residuals 18479 2276904 123.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this one way anova test, our null hypothesis is that the average price of each kinds of building are the same. Given a p-value much smaller than 0.05, we reject the null hypothesis and conclude that there are differences in the average price of each kinds of building. This indicates that the building classification is related to the rental price.
The detailed relationship across each types can be further checked by paired t-test.
for_map %>%
ggplot(aes(x = gross_income_per_sq_ft))+
geom_histogram()

fit1 = lm(gross_income_per_sq_ft ~ year_built, data = for_map)
summary(fit1)
##
## Call:
## lm(formula = gross_income_per_sq_ft ~ year_built, data = for_map)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.293 -6.901 2.249 7.864 42.745
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22.301489 6.192131 -3.602 0.000317 ***
## year_built 0.032274 0.003208 10.060 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.73 on 18494 degrees of freedom
## Multiple R-squared: 0.005442, Adjusted R-squared: 0.005388
## F-statistic: 101.2 on 1 and 18494 DF, p-value: < 2.2e-16
plot(fit1,which = 1)

plot(fit1,which = 2)
We applied SLR to test the significance of the years built. Given the
null hypothesis that the coefficient of years built term is 0, p-value
is small and we can reject the null. However, the estimated coefficient
is small and the price seems not normally distributed, which may lead to
inaccurate result.
What about the price near Columbia university and Columbia university Irving medical center?
latCU = 40.80754
lonCU = -73.96257
latCUIMC = 40.84252
lonCUIMC = -73.94243
near_columbia = for_map %>%
filter(
lat > latCU - 0.01,
lat < latCU + 0.01,
lon > lonCU -0.005,
lon < lonCU + 0.005
) %>%
plot_mapbox(
lon = ~lon,
lat = ~lat,
color = ~gross_income_per_sq_ft,
mode = 'markers',
alpha = 1,
hovertext = ~address,
text = ~gross_income_per_sq_ft,
meta = ~building_classification,
hovertemplate = "%{hovertext} <br> %{meta} <br>gross_income_per_sq_ft: %{text}",
hoverinfo = 'text',
colors = "Spectral"
) %>%
layout(
mapbox = list(
style = "streets",
zoom = 13,
center = list(lat = latCU, lon = lonCU)
)
)
near_CUIMC = for_map %>%
filter(
lat > latCUIMC - 0.01,
lat < latCUIMC + 0.01,
lon > lonCUIMC -0.005,
lon < lonCUIMC + 0.005
)%>%
plot_mapbox(
lon = ~lon,
lat = ~lat,
color = ~gross_income_per_sq_ft,
mode = 'markers',
alpha = 1,
hovertext = ~address,
text = ~gross_income_per_sq_ft,
meta = ~building_classification,
hovertemplate = "%{hovertext} <br> %{meta} <br>gross_income_per_sq_ft: %{text}",
hoverinfo = 'text',
colors = "Spectral"
) %>%
layout(
mapbox = list(
style = "streets",
zoom = 13,
center = list(lat = latCUIMC, lon = lonCUIMC)
)
)
subplot(near_columbia, near_CUIMC, nrows = 1, titleY = TRUE, titleX = TRUE, margin = 0.01)